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Abstract 

We present two new results for the problem of approximating a given real m x n matrix A 
by a rank-A: matrix D, where k < min{m, n}, so as to minimize \\A — -D||f^ It is known that by 
sampling 0(k/e) rows of the matrix, one can find a low-rank approximation with additive error 
e||A||p. Our first result shows that with adaptive sampling in t rounds and 0(k/e) samples in 
each round, the additive error drops exponentially as e*; the computation time is nearly linear in 
the number of nonzero entries. This demonstrates that multiple passes can be highly beneficial 
for a natural (and widely studied) algorithmic problem. Our second result is that there exists a 
subset of 0(k 2 /e) rows such that their span contains a rank-fc approximation with multiplicative 
(1 + e) error (i.e., the sum of squares distance has a small "core-set" whose span determines 
a good approximation). This existence theorem leads to a PTAS for the following projective 
clustering problem: Given a set of points P in R d , and integers k,j, find a set of j subspaces 
Fi,...,Fj, each of dimension at most k, that minimize X^ngp minj d{p, FA 2 . 

1 Introduction 

Given data consisting of points in high- dimensional space, it is often of interest to find a low- 
dimensional representation. In this paper, we consider the general problem of finding one or more 
(up to j) subspaces, each of dimension at most k, and representing each point by its orthogonal 
projection to the nearest subspace. Our goal will be to minimize the sum of squared distances of 
each point to its nearest subspace, a measure of the "error" incurred by this representation. 

This problem has been called projective clustering, since the j subspaces induce a partition of 
the point set. Algorithms and systems based on projective clustering have been applied to facial 
recognition, data-mining, and synthetic data [3 [251 [6], motivated by the observation that no single 
subspace performs as well as a few different subspaces. It should be noted that the advantage of a 
low-dimensional representation is not merely in the computational savings, but also the improved 



quality of retrieval. We discuss related theoretical work in Section 1.2. 

The case of j = 1, i.e., finding a single fe-dimensional subspace is an important problem in itself 
and can be solved efficiently (for j > 2, the problem is NP-hard [23], even for k = 1 |llj). Viewing 
the points as the rows of an m x n matrix A, we find the top k right singular vectors of this matrix 
via the Singular Value Decomposition (SVD). The projection itself is given by the rank k matrix 
Ak = AYY where the columns of Y are the top k right singular vectors of A. Note that among all 
rank k matrices D, A^ is the one that minimizes \\A — D\\^, = ^ AA^j — Dij) 2 . The running time of 
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this algorithm, dominated by the SVD computation, is 0(mm{mn 2 ,nm 2 }). Although polynomial, 
this is still too high for some applications. 

For problems on data sets that are too large or expensive to store/process in their entirety, one 
can view the data as a stream and the goal is to store/process a subset chosen judiciously on the 
fly and then extrapolate from this subset. Motivated by the question of finding a faster algorithm, 
Frieze et al. [TB] showed that any matrix A has a subset of k/e rows whose span contains an 
approximately optimal rank k approximation to A. In fact, the subset of rows can be obtained as 
independent samples from a distribution that depends only on the lengths of the rows. (In what 
follows, A^ l > denotes the iih row of A, as a column vector.) 

Theorem 1 ([16]). Let S be a sample of s rows of anmxn matrix A, each chosen independently 
from the following distribution: Row i is picked with probability 

\\A(i)\\ 2 
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If s > k/ce, then the span of S contains a matrix A k of rank at most k for which 

\\A - A k \\ 2 F < \\A - A k \\ 2 F + e\\A\\ 2 F . 

This can be turned into an efficient algorithm based on sampling [11] R The algorithm makes one 
pass through A to figure out the sampling distribution and another pass to sample and compute the 
approximation. Its complexity is 0(min{m, n}/c 2 /e 4 ). These results lead to the following questions: 
(1) Can the error be reduced significantly by using multiple passes through the data? (2) Can we 
get multiplicative (1 + e) approximations? (3) Do these sampling algorithms have any consequences 
for the general projective clustering problem? 

1.1 Our results 

Our first result is that the additive error term drops exponentially with the number of passes. Thus, 
low-rank approximation is a natural problem for which multiple passes through the data are highly 
beneficial. 

The idea behind the algorithm is quite simple. As an illustrative example, suppose the data 
consists of points along a 1-dimensional subspace of M n except for one point. The best rank 2 
subspace has zero error. However, one round of sampling will most likely miss the point far from 
the line. So we consider the following two-round approach. In the first pass, we get a sample 
and find a rank 2 approximation using it. Then we sample again, but this time with probability 
proportional to the error of the approximation. If the lone far-off point is missed in the first pass, 
it will have a very high probability of being chosen in the second pass. The span of the full sample 
now contains a very good rank 2 approximation. In the general theorem below, for a set of rows S 
of a matrix A, we denote by tts(A) the matrix whose rows are the projection of the rows of A to 
the span of S. 



1 Frieze et al. go further to show that there is an s x s submatrix for s — poly (k/e) from which the low-rank 
approximation can be computed in poly(fc, 1/e) time in an implicit form. 



Theorem 2. Let S = S\ U • • • U St be a random sample of rows of an m x n matrix A where 
for j = 1, . . . ,t, each set Sj is a sample of s rows of A chosen independently from the following 
distribution: row i is picked with probability 

ll pWll 2 
P P>c " 
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where E\ = A, Ej = A — ttsiU-us--i(A) and c is a constant. Then for s > k/ce, the span of S 
contains a matrix A k of rank k such that 

E S (\\A - A k \\ 2 F ) < _L-||A- A k \\ 2 F + e l \\A\\ 2 F . 

The proof of Theorem [2] is given in Section [2j The resulting algorithm, described in Section [3] 
uses 2t passes through the data and 0(Mst + (m + n)s 2 t 2 ) computation time where M is the 
number of nonzeros in A. Although the sampling distribution is modified t times, the matrix itself 
is not changed and so its sparsity is maintained. The algorithm fits the streaming model in that 



the entries of A can arrive in any order (see Section 1.2). The space used is 0((m + n)kt/e). 

Theorem ^implies that for any matrix A, there exists a subset of kt/e rows whose span contains 
a rank-fc matrix whose error is within an additive e 4 ||^4||p of the best rank-A: matrix. Can this be 
improved? In particular, is there a small subset of rows whose span contains a i&nk-k matrix whose 
error is within a (1 + e) multiplicative factor of the error of the best possible rank-/c approximation? 
Our next theorem answers this question affirmatively. 

Theorem 3. For any matrix A, there exists a subset of Ak 2 /e rows in whose span lies a rank-k 
matrix A k such that 

\\A-A k \\ 2 F <(l + e)\\A-A k \\ 2 F . 

The proof of this theorem also uses iterative sampling, albeit in a "backwards" manner and 
yields a simple sampling-based algorithm for finding such an approximation only for k = 1. The 
proof for general k is by induction on k, and uses the sampling algorithm for k = 1 to extend 
the (approximately) best (k — l)-dimensional subspace to an approximately best fc-dimensional 
subspace. Although this existence result does not imply an algorithm faster than the SVD for 
finding such an approximation, it will be the key ingredient in our last result — a polynomial-time 
approximation scheme (PTAS) for the general projective clustering problem (j > 2). 

We restate the problem using the notation from computational geometry: Let d(p, F) be the 
orthogonal distance of a point p to a subspace F. Given a set of n points P in IR rf , find a set of j 
fc-dimensional subspaces Fi,...,Fj such that 

C({F 1 ...F j }) = y^mmd&Fi) 2 
peP 

is minimized. When subspaces are replaced by flats, the case k = corresponds to the j-means 
problem (with the sum of squares objective function). 

Theorem [3] suggests an enumerative algorithm. The optimal set of fc-dimensional subspaces 
induces a partition Pi, ... ,Pj of the given point set. In each set Pi, there is, by Theorem [3l a 
subset of size 0(k 2 /e) in whose span lies a (1 + e) approximation to the optimal ^-dimensional 
subspace for this set Pi. So we consider all possible combinations of j subsets each of size 0(k 2 /e), 



and a <5-net of /c-dimensional subspaces in the span of each subset. The 8-net depends on the points 
in each subset and is not just a grid, as is often the case. Each possible combination of subspaces 
induces a partition and we simply output the best. Since the subset size is bounded (and so is the 
size of the net), this gives a PTAS for the problem (see Section^). 

Theorem 4. Given n points in R and parameters B and e, in time 

/n\0(jk 3 /e) 

we can find a solution to the projective clustering problem which is of cost at most (l + e)B provided 
there is a solution of cost B. 

Our technique can also be viewed as an extension of the idea of core-sets. Roughly stated, 
a core-set is a small subset of the data which captures a near-optimal solution to the entire set. 
Theorem K^ states that there is a core-set of size 0(k 2 /e) for minimizing the squared error to a 
rank k subspace. Unlike proofs of core-sets for other problems, our proof relies on the probabilistic 
method along with the properties of the SVD. Finally, our result can be extended to finding affine 
subspaces instead of linear subspaces. 

1.2 Related work 

Following the work of [16] and [H] which introduced matrix sampling for fast low-rank approxima- 
tion, Achlioptas and McSherry [I] gave an alternative sampling-based algorithm for the problem. 
Their algorithm achieves similar bounds (see [T] for a detailed comparison) using only one pass. It 
does not seem amenable to the multipass improvements presented here. Subsequently, Bar-Yossef 
[S] has shown that the bounds of these algorithms for one or two passes are optimal up to polynomial 
factors in 1/e. 

These algorithms can also be viewed in the streaming model of computation [20] . In this model, 
we do not have random access to data; the data comes as a stream and we are allowed one or 
a few sequential passes over the data. Algorithms for the streaming model have been designed 
for computing frequency moments [7], histograms |17] . etc. and have mainly focused on what 
can be done in one pass. There has been some recent work on what can be done in multiple 
passes [121 [15] . The "pass-efficient" model of computation was introduced in [20]. Our multipass 
algorithm fits this model and investigates the tradeoff between approximation and the number of 
passes. Feigenbaum, et. al [15] show such a tradeoff for computing the maximum unweighted 
matching in bipartite graphs. 

The results of our paper connect two previously separate fields — low-rank approximation and 
projective clustering. As mentioned earlier, projective clustering has been used in various contexts 
[5] [25j [6]. In [3], the authors consider the same problem as in this paper, and propose a variant of 
the j-means algorithm for it. Their paper has promising experimental results but does not provide 
any theoretical guarantees. There are theoretical results for special cases of projective clustering, 
especially the j-means problem [k = 0, find j 0-dimensional affine subspaces, i.e., points). Drineas 
et al. [TT] gave a 2-approximation to j-means using SVD. Subsequently, Ostrovsky and Rabani 
|24] gave the first randomized polynomial time approximation schemes for j-means (and also the 
j-median problem). Matousek [22] and Effros and Schulman p3] both gave deterministic PTAS's 
for j-means. Fernandez de la Vega et al. [TO] describe a randomized algorithm with a running time 
of 0(ra(logn) '^).z Using the idea of core-sets, Har-Peled and Mazumdar [18] showed a (1 + e) 
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approximation algorithm that runs in linear time for fixed j, e. Kumar et al. [21] give a linear- 
time PTAS that uses random sampling. There is a PTAS for k = 1 (lines) as well [2]. Other 
objective functions have also been studied, e.g. sum of distances (j- median when k = 0, [2H [18]) 
and maximum distance (j-center when k = 0, [8]). For general k, Har-Peled and Varadarajan |19j 
give a (1 + e) approximation algorithm for the maximum distance objective. Their algorithm runs 
in time dn°^ k "^(V 6 )/ 6 ) a nd is based on core-sets (see [3] for a survey). 

1.3 Notation and Preliminaries 

Let A G W nxn . Let A^ 1 ' denote the ith row of A, seen as a column vector. Any matrix accepts a 
singular value decomposition, that is, it can be written in the form 

A = J2^ {t) v (l)T 

i=l 

where r is the rank of A and o\ > a% > • • • > <r n > are called the singular values; {v> x > , . . . , u^ r '} G 
M m , {wW, . . . , v ( r >} G M n are sets of orthonormal vectors, called the left and right singular vectors, 
respectively. It follows that A vS 1 ' = UiV^"' and Av^ 1 ' = aiu^ 1 ' for 1 < i < r. 

The Frobenius norm of a matrix A G M" 1X?1 having elements (a^-) is denoted ||^4||f and is given 
by 



\A\\ 



i 



m n 



It satisfies ||.A[||> = X^[=i °f • 

For a subspace V C M n , let irv,k(A) denote the best rank-/c approximation (under the Frobenius 

norm) of A with its rows in V. Let irk{A) = 7Tr« fe(A) = X^«=i giu^'v^ 1 ' be the best rank-A; 
approximation of A Also 7ry (A) = 7ry )ra (^4) is the orthogonal projection of A onto V. When we say 
"a set (or sample) of rows of A" we mean a set of indices of rows, rather than the actual rows. For 
a set S of rows of A, let span(S') C W 1 be the subspace generated those rows; we use the simplified 
notation 7r s (A) for 7r span(5) (A) and Tr S ,k( A ) for Kspaa(S),k( A )- 

For subspaces V, W C M n , the sum of them is denoted V + W and is given by 

V + W = {x + y£R n : x G V, y G W}. 

The following elementary properties of the operator ny will be used: 

• ny is linear, that is, ir v (\A+B) = \ir v (A)+Tr v (B) for any A G R and matrices ^4, i3 G M mxn . 

• If V, VF G W 1 are orthogonal linear subspaces, then nv+w(A) = ttv(A) + -kw(A), for any 
matrix iel raxn . 

For a random vector u, its expectation, denoted E(V), is the vector having as components the 
expected values of the components of v. 



2 A random sample contains a good approximation 

We will prove Theorem[2]in this section. It will be convenient to formulate an intermediate theorem 
as follows. 

Theorem 5. Let A G M mxn . Let V C W 1 be a vector subspace. Let E = A — Try {A). For a fixed 
c£l, let S be a random sample of s rows of A from a distribution such that row i is chosen with 
probability 

||£«||2 
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(1) 



Then, for any nonnegative integer k, 



E S (\\A - 7T V+span(slk (A)\\ 2 F ) < \\A - n k (A)f F + -\\E\\ 2 F . 



cs 



Proof. For S = (ri)j =1 a sample of rows of A and 1 < j < r, let 
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Then, E s (w^) = tt v (A) t u^ + E t u^ = ajv®. Now we will bound E s (\\w^ -(TjV^\\ 2 ). Use the 
definition of w^' to get 
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Apply the norm squared to each side and expand the left hand side: 
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Observe that 



which implies that 



E< 



u< _ 
P~, 



U) 



m CO 
— u) 



-£(*) =Y,Pi\E®=E T u^\ 



i=l 



2 A u U) 



E s [-^2 jj-E™ ■ (E T u^) = 2\\E T u^f. 



t=l 



Using this, apply Eg to Equation ^ to get: 

E s (\\w ij) -ajv ij) \\ 2 ) =E S 
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Now, from the left hand side, and expanding the norm squared, 

2 N 



E 



5' 



1 A ui j) 



V^ u ^ E (r i 



& ■ i ± Ti 

1=1 l 



P 



i=i \ 

2 v-^ 

+ ~2 E * 



V)E<r4\ 



Pi 



+ 



'u%>EM u^E^y 



Ki<l<s 



P 



P, 



(5) 



where 
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and, using the independence of the t-j's and Equation ([3]), 
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The substitution of Equations ([6| and ([7| in ([5]) gives 
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Using this in Equation Q we have 



1 m IU/^')p(*)l|2 1 

E 5 ([kW -^|| 2 ) = i^ K f I! _ I||^ u0 -)|| 2 , 
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and, using the hypothesis for Pj (Equation (111)), remembering that u^> is a unit vector and dis- 
carding the second term we conclude 



EsfWw® - <7jv®f) < — \\E\\ 2 F . 

cs 



(8) 



Let yV' = —w^' for j = l,...,r, let k' = min{A;,r} (think of k! as equal to k, this is the 

interesting case), let W = spanjy' 1 ^, . . . , y( '}, and F = -AX^t=i v ^ • We will bound the error 
1 1 -A — 7rw{A) \\p using F. Observe that the row space of F is contained in W and ttw is the projection 
operator onto the subspace of all matrices with row space in W with respect to the Frobenius norm. 
Thus, 



\A-7r w (A)\\ 2 F <\\A-F\\ F . 



Moreover, 



k' 



\A-Ff F = J2\\(A-Ffu^f = J2\ 
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Taking expectation and using p| we get 



k k 

-^H-'-^IIf;^ E a t + -\\ E \\F = \\A-TT k (A)\\ 2 F + -\\E\\ F . 
^-^ cs cs 

i=k+l 
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This and Equation (l9|) give 



E S (\\A - tt w (A)\\ 2 f ) < \\A - 7r k (A)\\ 2 F + -\\E\\ 2 F . 

cs 



(11) 



Finally, the fact that W C V + span(S') and dim(VF) < k imply that 

\\A - ^v +span( 5),fe(^)llF < \\ A ~ *w(A)\\ 2 F , 



and, combining this with Equation (11), we conclude 



E S (\\A - 7r v+spMS)tk (A)\\ 2 F ) < \\A - n k (A)\\ 2 F + ^\\E\\ 2 F . 



We can now prove Theorem [2] inductively using Theorem [5] 
Proof, (of TheoremUfy. We will prove the slightly stronger result 



□ 



E S (\\A - n s , k (A)f F ) < \ 4 }t \\A - 7r k (A)\\ 2 F + ( ±) \\A\ 



' -^-\\A--n k {A)\\ 2 . 

1 — — \cs 

cs v 

by induction on t. The case t = 1 is precisely Theorem [T] 

For the inductive step, let E = A — nsiu— uSt-i (-^) ■ ^y means of Theorem p| we have that, 

E St (||A-7r SlU ... u5e , fc (A)|||.) < ||A-7r fc (A)|||. + -||^|||. 

cs 

Combining this inequality with the fact that ||i?||^ < \\A — ^SiU—uSt-i,k(A)\\ F we get 

E St (||A - n SlU ... uSt ,k(A)\\ 2 F ) < \\A - n k (A)\\ 2 F + -\\A - ir Sl u...us t -i,k( A )\\F- 
Taking the expectation over Si, ... , St-i- 

%(P - 7r SlU ...us t ,fc(A)|||.) < \\A - n k (A)\\ 2 F + ^1^,...,^ (\\A - Tr Sl u-us t ^,k(A)\\ 2 F ) 
and the result follows from the induction hypothesis for t—1. □ 

3 Algorithm 

In this section, we present the multipass algorithm for low-rank approximation. We first describe 
it at a conceptual level and then give the details of the implementation. 

Informally, the algorithm will find an approximation to the best rank-A; subspace (the span 
of v ' 1 \ . . . , v ' ' by first choosing a sample T of s random rows with density proportional to the 
squared norm of each row (as in Theorem IT]). Then we focus ourselves on the space orthogonal to 
the span of the chosen rows, that is, we consider the matrix E = A — itt{A), which represents in 
some sense the error of our current approximation, and we sample s additional rows with density 
proportional to the squared norm of the rows of E. We consider the union of this sample with our 
previous sample, and we continue adding samples in this way, up to the number of passes that we 
have chosen. Theorem [2] gives a bound on the error of this procedure. 



Fast SVD 

Input: A G M mxn , integers k <m, t, error parameter e > 0. 

Output: hi . . . , h k £ M n such that with probability at least 3/4 their span V satisfies 

\\A - MA)\\ 2 f < (l + ^~) \\A - n k (A)f F + 4e*||A|| 2 F . (12) 

1. Let 5 = 0, s = k/e. 

2. Repeat t times: 

(a) Let E = A-tt s (A). 

(b) Let T be a sample of s rows of A according to the distribution that assigns probability 

||g(')|| 2 



WMf 



to row i. 



(c) Let S = S U T. 
3. Let hi, . . ■ , /ifc be the top k right singular vectors of irs(A). 



Let M be the number of non-zeros of A. 
Theorem 6. Algorithm Fast SVD is correct and has running time O [M— + (m + n)—^ 



Proof. For the correctness, observe that ttv(A) is a random variable with the same distribution as 
TTS,k(A) as defined in Theorem Also, \\A — 7rg t fc(A)|||i — ||^4 — 7Tfe(^4)|||. is a nonnegative random 
variable and Theorem |2] gives a bound on its expectation: 

Es(P - 7r Slfc (A)|||. - ||A - 7r fe (A)|||) < ^P - 7r fe (A)||| + e*||A|||.. 
Markov's inequality applied to this variable gives that with probability at least 3/4 
\\A - 7rv(A)\\ 2 F - \\A - 7T k (A)\\ 2 F <-^-\\A- n k (A)\\ 2 F + 4<?\\A\\ 2 F . 



which implies inequality (12). 



We will now bound the running time. We maintain a basis of the rows indexed by S. In 
each iteration, we extend this basis orthogonally with a new set of vectors Y, so that it spans 
the new sample T. The residual squared length of each row, ||.EW|| 2 , as well as the total, ||^||p>, 
are computed by subtracting the contribution of ttt(A) from the values that they had during the 
previous iteration. In each iteration, the projection onto Y needed for computing this contribution 
takes time O(Ms). In iteration i, the computation of the orthonormal basis Y takes time 0(ns 2 i) 
(Gram-Schmidt orthonormalization of s vectors in IR n against an orthonormal basis of size at most 
s(i+l)). Thus, the total time in iteration i is 0(Ms+ns 2 i); with t iterations, this is 0(Mst+ns 2 t 2 ). 
At the end of Step pq we have tts{A) in terms of our basis (an m x st matrix). Finding the top 
A; singular vectors in Step ^ takes time 0(ms 2 t 2 ). Bringing them back to the original basis takes 
time 0(nkst). Thus, the total running time is 0(Mst + ns 2 t 2 + ms 2 t 2 + nkst) or, in other words, 
o(Mf + (m + n)^Y □ 



4 Existence of a subset with multiplicative (1 + e) error 

In this section, we will prove Theorem [3J The first step is to show that for any matrix A, there is 
a row a such that the span of a is a factor 2 approximation to the best rank-1 subspace. 

Lemma 7. In any matrix A, there is a row a such that 

\\A - 7T spMa) (A)\\ 2 F < 2\\A - 7Tl(A)|||. 

Proof. As in the proof of Theorem pH let b be (the index of) a random row of A picked according 
to the distribution that assigns probability Pi = \\A^'\\ 2 /\\A\\ F to row i. Define the random vector 



w 



„(l) 



where u^ 1 ' is the top left singular vector of A. Then, we have K(w) = aiv"' and, by expanding 
\\w — aiv^'ll , we also have 



E ( \\w — o\v 



(i) II 2 
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and, writing the expectation as a sum and using the definition of Pb, 



£iw 
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\A\\ F -a 1 
\A-^{A)f F . 



Therefore, as in Equations ([9]) and (10), 



E(\\A-7r spMb) (A)\\ F )<E{\\w-a 1 v^\\ 2 ) + £>f < 2\\A-^i{A)f F . 

%=2 
and hence there exists a row that proves the lemma. □ 

Proof, (of Theorem pi) We will prove by induction on k that for integers s, k and A 6 ]j mxn there 
exists a subset 5 of rows of A of size (s + l)k such that 

||A-7r Sifc (A)|||.<(l + ^ \\A-7T k (A)\\ F . 

By Lemma [7j there is a row 6 of A such that 

||A-7r 8pan(6) (A)|||.<2||A-7r 1 (A)|||.. 
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For k = 1, apply Theorem p\ with V = span (b); we get that there is a set S 1 of s indices of rows 
of A such that 

\\A-ir Su{b}il (A)\\ 2 F < ||A-7ri(A)|||. + -||A-7r Bpan(6) (A)|||, 

2\ |2 



< II + -)\\A-n 1 (A)\\ F . 

Now, to extend this to higher k, one might consider the approach of finding such a sample to 
approximate the best rank 1 subspace, projecting orthogonal to it and repeating. This does not 
work. The reason is that the error incurred by a higher rank approximation could be much smaller: 
an e multiplicative error at an earlier stage (e.g., for the first stage, multiplicative with respect to 
\\A — TTi(A)\\p, the best rank-1 approximation) is already too large (given that at the end we want 
a multiplicative error with respect to \\A — 7Tf.(A)\\p, the best rank-fe approximation). 

Instead, for proving the inductive step, we will use the sampling idea backwards. In our context, 
finding a rank-(fc + 1) approximation to A is equivalent to finding a (k + l)-dimensional subspace 
V onto which to project, so that iry {A) is the approximation. If we think of our problem as finding 
such a subspace, then the quality of subspace V is given by \\A — 7iy(.A)|||i. We want to find a 
{k + l)-dimensional subspace in the span of (s + l)(k + 1) rows of A that is worse than the best 
possible only by a multiplicative factor of (1 + 2/s) .We will first show the existence of s + 1 rows 
such that if we replace v^' in the best rank-/c + 1 subspace (generated by i/ 1 ', . . . ,v^ k+l ') with a 
vector in the span of those rows, then the resulting k + 1-dimensional subspace is worse that the 
best by at most a (1 + 2/s) multiplicative factor. Then we will focus on our problem projected 
onto the orthogonal complement of span(6) to get from the inductive hypothesis that the best k- 
dimensional subspace it this restricted problem (which happens to be generated by v^ 2 \ . . . , y( k+1 >) 
can be replaced by another fc-dimensional subspace in the span of (s + l)k rows, which is worse that 
the best by a multiplicative factor of at most (1 + 2/s) . The combination of these two replacements 
will give the desired result. 

Suppose inductively that for some k and any matrix A and 1 < k' < k, there exists a subset of 
rows of A of size (s + l)k' such that 

1 + ^) \\A-n k/ (A)\\ F . 

To prove this for k + 1, let V be the optimal rank- (A: + 1) subspace. Let V be the subspace of V 
orthogonal to v^- 1 '. Project the rows of A orthogonal to V to get a matrix C, i.e., 

C = A-ir v ,{A). 

Applying the hypothesis with k' = 1 to C, we get that there exists a vector b in the span of a set 
S" of s + 1 rows of C such that 

IIC - 7r spaa(6) (C)||| < (l + ^) \\C ~ 7ri(C)||! = (l + ^) \\C ~ ir Bpan{v m)(C)\\ F . (13) 

Notice that V and span(6) are orthogonal subspaces, this implies that 

|| A - TTV'+spwa.(b)(A)\\ F = \\A-7T V/ (A) - 7r span ( b )(A)|||. 
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This combined with the fact that 7r sp a,n(b)(A) = 7r span ( fe )(C) and the definition of C gives 

\\A — 7iy+span(&)(^)llF = \\C — VT span ( b )(C) \\ F . 



We can now apply Equation (13) to get 

\\A-TT v , +span ( b) (A)\\ 2 F < (1 + -) ll C '- 7r S pan(«( 1 ))( C ')ll|- 

Use the definition of C again and the fact that 7r span / u (i)\(C) = ^ spa , n ( v m)(A) to conclude 



,(bM)\\ 2 F <( 1 + l) \\ A ~ *v>(A) - 7r span{vW) (A)\\] 



l + ^)||,l-^-(--»)||7,. 



(14) 



Now project A to the subspace orthogonal to b to get a matrix A' = A — vr span (b) (A) . Applying 
the inductive hypothesis to A' and k' = k, we get a set S' of (s + l)k rows such that (remember 
that V is /c-dimensional) 



\A>-ir s , >k (A>)\\ F <(l+*) \\A'-n k (A')\\ F <(l + 2 ^j \\A f - ir v ,(A>)\\l 



(15) 



The combination of the rows that we got from the two applications of the induction hypothesis 
gives us a set S = S' U S" of (s + l)(k + 1) rows of A, and we will see now that this set proves the 
induction for k + 1. 

Note that 7r S p a n(b)(-<4) — ^S',k(A') is a matrix of rank at most k + 1 whose row-space is contained 
in span (5). This implies 

\\A - Tr s ,k+i(A)\\ F < \\A - 7r span{b) (A) - ns>, k ( A ')\\ 2 F, 

using the definition of A': 

= \\A'-7r s ,^A')\\ 2 F , 



and using Equation (15): 



< 



1 + ^J \\A'-n vl (A')\\ F 



iFiom this, the fact that 7iy/ as a projection satisfies \\A' — iryi(A')\\ F < \\A' — B\\ F for any matrix 
B G ]g mxn having row-space in V gives us: 

\\A - 7v s , k+1 (A)f F < (l + ?\ \\A' - 7r v iA)\\ F , 

using the definition of A' again: 

< ( 1 + -J \\A - TT span{b) (A) - 7T V ,(A)\\ F , 
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using the orthogonality of span(6) and V: 

= (l + -J \\A - n v , +span{b) (A)f F , 



and using Equation (14), 



H) 



fe+i 

2 



< 1 + - M-ttk(A)|| 



This finishes the induction. The choice of s = — gives us the theorem. □ 

5 Application: projective clustering 

In this section, we give an approximation algorithm for the projective clustering problem described 
in Section |1.1| The algorithm is motivated by Theorem [3] Let V\ , . . . , Vj be the optimal subspaces 
partitioning the point set into P±, . . . ,Pj, where Pi is the subset of points closest to V{. Theorem [3] 
states that there exists a subset P\ C Pi of size 4k 2 /e in whose span lies an approximately optimal 
fc-dimensional subspace W{. We can enumerate over all combinations of j subsets, each of size 
4fc 2 /e to find the Pi, but we cannot enumerate the infinitely many fc-dimensional subspaces lying 
in the span of P\. 

Thus, we cannot hope to find Wi given JP$. A natural approach to solve this problem would be 
to put a finite grid on the points in the span of Pi. The hope would be that there are k grid points 
<7i, . . . , gt whose span G is "close" to Wi, since each basis vector for Wi is close to a grid point. 
Although G may be "close" to Wi, there may be a point p G Pi for which d{p,G) 2 3> d{p,Wi) 2 , 
i.e. G incurs a much greater error than Wi. Indeed, consider a point p G Pi whose distance to the 
origin is very large. Although the distance between a basis vector and a grid point might be small, 
the error induced by projecting a point onto the grid point will be proportional to its distance to 
the origin. 

The problem described above implies that a grid construction must be dependent on the point 
set Pi. Our grid construction does depend on Pi; we consider grid points in balls around a subset 
of Pi . The grid is laid down in the span of Pi (actually, a subspace of slightly higher dimension) to 
reduce the size of the grid, which is exponential in the dimension of the points. In Lemma |8j we 
show that there is a subspace Fi that is the span of k grid points such that ^2„ & p. d(p, Fi) 2 is not 
much worse than ^2 p€ p- d(p, Wi) 2 . This construction avoids the problem that points far away from 
the origin will have error proportional to their distance to the origin, since there are grid points 
close to such points. Although we find Fi in the span of Pi, it can be brought back up to Mr, where 
its error with respect to Wi is the same as in Pi. This is because Wi lies in the span of Pi. 

The algorithm is given below. 
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Algorithm Cluster 

Input: P C M. d , error parameter < e < 1, and B. 

Output: A set of j fc-dimensional subspaces F\ . . .Fj, such that 

C{{F l ...F j }) = Y^mmd{p,F i ) 2 
peP 

is at most (1 + e)B provided a solution of cost B exists. 

1. Set 5 = g*pL= ,R = J(l + ^)B + 25k. 

2. For each subset T of P of size 8jk 2 /e + jfc: 

(a) For each equipartition of {T\ . . . Tj) of T into j parts: 

i. Construct a 5-net Z?i with radius R for each T, in the span of Tj. 
ii. For each way of choosing j subspaces Fi,...,Fj, where Fj is the span of k 
points from Df. 
A. Compute the cost C({Fi . . . Fj}). 

3. Report the subspaces F\ . . . Fj of minimum cost C({F\ . . . Fj}). 



In Step 2(a)i we construct a 5-net Di. A 8-n.et D with radius R for a point set S" is a set of 
points such that for any point q within distance R of some p G S, there exists a g £ D such that 
d(q, g) < 5. To construct a 5-net D for a point set S with radius -R, we simply put a box of side 
length 2R around each point p G 5". Each point at grid length 5 /yd in each box is in £>, where <i is 
the dimension of the points in S. The size of the 5-net is thus exponential in the dimension of the 
points in S. This is why it is crucial that we constuct the 5- net for Tj in the span of Tf, if we were 
to simply create a 5-net for Ti in the original R space, the size of the net would be exponential in 



d\ By constructing the 5- net Di in the span of Ti in Step 2(a)i the size of Di is instead exponential 
in just 8A; 2 /e + k, the number of points in Tj. 

The correctness of the algorithm relies crucially on the next lemma. 

Lemma 8. Let P be a point set, and let W be a subspace of dimension k such that 

Y,d(p,W) 2 <a. 
p&p 

There exists a set CCP of k points such that there is a subspace F with the following properties: 

1. F is the span of k points from a 5-net D with radius \fa. + 25k for C . 

2. F is not too far from W: 

Y, d (p> F ) 2 ^ 2 d ( p > w "> 2 + 4k2n62 + 4k6 2 d ( p ' w ">- ( 16 ) 

p&p peP peP 
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Proof. We simultaneously construct F and choose the k points Pi,---,Pk in C in k steps. Let 
Fq = W. Inductively, in step i, we choose a point pi to put in C and rotate Fi-\ so that it includes 
a grid point gi around pj. The subspace resulting from the last rotation, Fj~, is the subspace F with 



the bound promised by the lemma. To prove that (16) holds, we prove the following inequality for 
any point p £ P going from i^_i to F% 

d(p,F i )<d(p,F i „ 1 ) + 28. (17) 

Summing over the k steps, squaring, and summing over n points, we have the desired result. 

Let G\ = {0}. Gi will be the span of the grid points {51,52, • • • , 9i-i}- We describe how to 
construct the rotation R{. Let pi £ P maximize 

IKj+CTFi-iCP))!! 
and let gi £ D minimize 

rf(7Ti!i_l(Pt),5i)- 

Put pi in C. Consider the plane Z defined by 7T G ±(gi),Tr G ±('K Fi _ 1 {pi)), and 0. Let 6 be the angle 
between Tr G ±(gi) and 7r G ±(ir Fi _ 1 (pi)). Let Ri be the rotation in the plane Z by the angle 6, and 
define F{ = i?jFj_i. Set Gj+i = Gi + span{#j}. 



Now we prove inequality (17). We do so by proving the following inequality by induction on i 
for any point p: 

d{7t Fi _ 1 {p),R l n Fi _ 1 {p))<25. (18) 



Note that this proves (17) by applying the triangle inequality, since: 



d(p,Fi) < d(p,F i -. 1 ) + d(ir Fl _ 1 (p),ir Fi (p)) 

< d(p,Fi-.i) + d(-K Ft _ 1 (p),R i it Fi _ 1 (p)). 

The base case of the inequality, i = 1, is trivial. Consider the inductive case; here, we are bounding 
the distance between n Fi _ 1 (p) and RiTT Fi _ 1 (p). It suffices to bound the distance between these two 
points in the subspace orthogonal to Gi, since the rotation Ri is chosen orthogonal to Gi. That is, 

d{TT Fl _ 1 (p),R l TT Fi _ 1 ( P )) < d(7T G ,x(7T Fi _ 1 (p)),i? 4 7r G x(7r Fi _ 1 (p))). 

Now, consider the distance between a point "Kq\.{is f ._ x ($))) and its rotation, RiTTQ±(ir Fi _ 1 (p)) . This 

i i 

distance is maximized when ||7r G j_(7r_p i _ 1 (p))| | is maximized, so we have, by construction, that the 
maximum value is achieved by pf 

d(7r G ±(Tv Fi _ 1 (p)),R i Tr G ±(7v Fi _ 1 (p))) < d{n G ±{-K Fi _ l (p i )),Ri'K G J r {'K Fi _ 1 {pi))). 
By the triangle inequality we have: 

d(K G ±(it Fi _ x (p i )),Ri'K G ±(-K F ._ 1 (pij)) < d{-K G ±(-K Fi _ x (j>i)),-K G ±(gi)) + d(ir G ±(gi),R i TT G J r (n Fi _ 1 (pi))). 
To bound the first term, d(TT G ±(ir Fi _ 1 (pi)),TT G j_(gi)), note that 

d(7r G j.(7TF i _ 1 (pi)),7r G j.(^i)) < d{-K Fi _ x {pi),gi). 
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We show that itFi-i(pi) is within a ball of radius R around Pi\ this implies 

d{-K Fi ^{Pi),gi)<5 (19) 

by construction of the <5-net around pi. We have: 

i-2 
d(pi,1TFi-i(pi)) < d(pi,F ) + ^2 d{^F 3 {pi)^F 3+1 {Pi)) 

i=i 

i-2 

< VCX + ^ d(TT Fj (pi),R j + 1 TT Fj (pi)) 

3 = 1 

< Va + 25k = R. 

The third line uses the induction hypothesis. 

Now we bound the second term, d(TT G ±(gi), Riir G ±(-KF i _ 1 (pi)))- Note that Riir G ±(-KF i _ 1 (pi))) 
is just a rescaling of ft G ±(gi) and that | |7t g _l (7Tj? 1 ._ 1 (p*)) 1 1 = \\Ri'^ G -i-('^F i - 1 (p*))\\i since rotation 
preserves norms. The bound on the first term implies that |K G ±(gi)|| > || 7r G ± ( 7r ^-i(^'*))ll ~~ ^> so 

d(ir G ±(gi),R i n a L(ir Fi _ 1 (pi))) < 5. (20) 



Combining (19) and (20), we have proved (U8b. D 



Now, we are ready to prove Theorem [4] which proves the correctness of the algorithm. 

Proof. Assume that the optimal solution is of value at most B. Let V\,...,Vj be the optimal 
subspaces, and let Pi, ... , Pj be the partition of P such that Pi is the set of points closest to V{. 
Let rii = | Pi |, with Y^i n i = n - By Theorem ^1 there exists a subset Si of Pi of size at most 8k 2 /e 
such that there is a subspace Wi in the span of Si with 

^%if 4 ) 2 <(i + ^%,y 4 ) 2 - (2i) 

P ePi P eP, 

Consider each subset Pi and its corresponding subspace Wi. Now, apply Lemma [8] to Pj and Wi 
using R, 5 as in the algorithm. Since the optimal solution is of value at most B, we have that, for 
every i: 

J>(p,v^) 2 <(i + £)z?. 

p&pi 

Let Ci, Fi be the set of k points and subspace, respectively, promised by the lemma. Pj is the span 
of k points from a <5-net of Q and obeys the following inequality: 

£ dip, Fl f < a + 1) E d & ^ 2 + h B+ h B - 

Let T = |L SiUCi. We have that |T| = 8jk 2 /e+jk. The algorithm will enumerate some set T' ^>T 
in Step [2J Furthermore, it will consider a partition {T{ . . . 71'} such that T/ D Si U Cj in Step [2a 
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Let D[ be the <5-net for T[ projected to the span of T[. Since the algorithm enumerates over 
all subspaces lying in the span of D' i: the algorithm will consider the subspaces Fi, . . . , Fj whose 
existence is proven above. The cost associated with F\, . . . , Fj is: 

caFi-Fj}) < JzY. d ^F? 

i=i P ep l 
< {l + e)B. 

The running time analysis follows by the following bounds. The number of subsets of size 
8jk 2 /e + jk is at most n 8 -?'* i e+ ^ k . The number of equipartitions of a set of size 8jk 2 /e + jk into j 
parts is at most j 8 - 7 / e +J fe . Recall that the <5-net D for a point set T of dimension d is implemented 
by putting a box with side length 2R of grid width 5 /yd around each point in T. Let X be the set 
of grid points in the box around a point p. The number of subspaces in each 5- net Di is therefore at 
most ((8fe 2 /e + k) |X|) , so the number of j subspaces that one can choose for a partition (T\ . . . Tj) 

is ((8fc 2 /e + k) \X\) . The computation of projecting points, finding a basis, and determining the 
cost of a candidate family of subspaces takes time 0(ndjk). The cardinality of X is (remember 
that e < 1): 

/ o o \ 8fc2 A+ fc / i— x 8k 2 /e+k 

I Art \ ./„/., jn y 



\ x \=\ ; < [O (jk 

\S/y/8k 2 /e + kJ -\V V < 

Therefore, the running time of the algorithm is at most 

0{ndjk) n *i#J*+* fi#/'+ik ((8k 2 /e + k) \X\) ]k = d f n ^° {! ''' ' 
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